Scientific Data — Latest Matching Preprints

1

FennoTraits: Dataset of plant functional traits and community composition in northern European flora

Niittynen, P.; Kemppinen, J.

2026-04-09 plant biology 10.64898/2026.04.07.716889 medRxiv

Top 0.1%

58.3%

Show abstract

We present here FennoTraits, which is a dataset of plant functional trait and community composition data which we collected from Fennoscandia across northern Finland, Norway, and Sweden in 2016-2025. This dataset has 42 049 abundance estimations and 155 794 functional trait observations from 10 traits representing 373 vascular plant species collected from 1 235 study sites within seven study areas. The trait measurements consist of size-structural, leaf economic, leaf spectral, and reproductive traits. The species represent the majority of the native vascular plant species that occur at the seven study areas, and many of the species occur in all seven areas across the two biomes and their ecotone: tundra and boreal forests. Each study area has distinct characteristics and a range of habitats: tundra, meadows, wetlands, shrublands, and boreal forests. These areas are under low anthropogenic influence, and many of the sites are within protected areas that are reserved for nature conservation and scientific research. Finally, we provide with this dataset a general description of the main trait patterns and profiles of the northern European flora.

2

The OS-Prey (Omnibus Study of Prey) database: A compilation of diet records for birds of prey.

Uiterwaal, S. F.; La Sorte, F. A.; Coblentz, K. E.; DeLong, J. P.

2026-03-31 ecology 10.64898/2026.03.28.714998 medRxiv

Top 0.1%

27.9%

Show abstract

MotivationThe diet composition of a predator is a direct reflection of its role in a food web, resulting from interactions with prey species. Raptors (including hawks, owls, and falcons) are ubiquitous predators with diverse diets, yet there is no comprehensive database of raptor diet composition. We present a database of over 3500 raw raptor diet records, compiled from more than 1000 studies and representing 173 raptor species from across the world. Our dataset complements existing qualitative summaries of species diets by compiling thousands of quantitative diet "samples" over time and space to present diet data at a uniquely fine resolution. Main types of variable containedThe database comprises published records of raptor diets from pellets, prey remains, direct or photographic observations, prey DNA, and raptor gut or gullet contents. For each diet, we present the taxonomic identity and amounts of consumed prey. We additionally present various metadata for each diet such as location, habitat, and season. Spatial location and grainThe study incorporates diet records collected worldwide, with each record assigned geographic coordinates corresponding to the location where the diet information was obtained. Time period and grainThe database includes diet records from 1893 to 2025. We report a year for each diet record. Major taxa and level of measurementWe recorded raptor diet at the species level, including raptors from three orders: Strigiformes, Falconiformes and Accipitriformes excluding vultures. Most prey are identified to species, but prey taxonomic level varies depending on the extent to which they could be identified. Software formatDiet records and metadata are provided in two files with comma-separated value (.csv) format.

3

Metabolic fingerprinting of 17 Brassicaceae species across three tissues

Wolters, F. C.; Woldu Semere, T.; Schranz, M. E.; Medema, M. H.; Bouwmeester, K.; van der Hooft, J. J. J.

2026-04-21 plant biology 10.64898/2026.04.17.719198 medRxiv

Top 0.1%

27.5%

Show abstract

Plants produce the most diverse blends of specialized metabolites on earth. Natural products derived from plants are valuable resources for drug development, food chemistry, and crop resistance breeding. Phenotypes of specialized metabolite profiles can be captured by untargeted mass-spectrometry across species phylogeny, tissues, and genotypes. Here, we collected metabolic fingerprints of 17 Brassicaceae species across three tissues (paired leaf and root; flower) using liquid chromatography-tandem mass spectrometry (LC-MS/MS) in positive and negative ionization mode. Corresponding metadata has been refined for reuse according to ReDU guidelines, and for integration with public genomic and transcriptomic data. Standardization of in vitro growth conditions, and data processing workflows enables integration of acquired raw and processed data across platforms for single- and multi-omics analysis. Further, the inclusion of tissue-specific metabolic profiles across ploidy levels, as well as across crop species and wild relatives, makes this dataset a valuable resource for natural product discovery.

4

QNPtoVox: A methods pipeline for mapping 2D quantitative neuropathology to 3D MNI voxel space.

Madan, R.; Crane, P. K.; Gennari, J. H.; Latimer, C. S.; Choi, S.-E.; Grabowski, T. J.; Mac Donald, C. L.; Hunt, D.; Postupna, N.; Bajwa, T.; Webster, J.

2026-04-21 neuroscience 10.64898/2026.04.17.719274 medRxiv

Top 0.1%

22.2%

Show abstract

1.Quantitative neuropathology has advanced through whole-slide imaging and digital histology platforms. Yet, these measurements rarely align with neuroimaging coordinate frameworks that may be useful for spatial modeling and other applications. QNPtoVox, short for quantitative neuropathology to voxels, is a reproducible, modular pipeline that transforms quantitative metrics generated by digital pathology software (HALO) into voxel-based maps registered to a standard common coordinate (MNI) template. The workflow integrates digital histopathology, gross tissue photography, ex-vivo MRI, and nonlinear registration to generate spatially standardized 3D pathology representations. This Methods article provides a complete procedural description, including required materials, step-wise instructions, operator-dependent checkpoints, expected outputs, reproducibility evaluation, and troubleshooting. QNPtoVox enables voxel-level integration of neuropathology with neuroimaging tools, unlocking existing histopathology datasets for computational modeling and cross-cohort harmonization.

5

Automated Extraction and Meta-Analysis of a Century of Motor-Unit Research with NeuromechaniX

Del Vecchio, A.; Enoka, R. M.

2026-04-10 physiology 10.64898/2026.04.08.717204 medRxiv

Top 0.1%

18.6%

Show abstract

The scientific literature on human motor units and electromyography (EMG) spans over a century (1925-2025), comprising research impossible to synthesize manually. We introduce NeuromechaniX, a domain-specific platform for automated extraction and meta-analysis of this literature. The core component, MUscraper, is a large language model pipeline that extracts approximately 200 structured metadata fields, organized into 17 major sections spanning participant demographics, EMG acquisition parameters, muscle identification, task protocols, decomposition methods, and motor-unit outcomes, from [~]2,000 publications on human limb muscles. This automated extraction transforms heterogeneous narrative reports into a standardized, queryable database at a scale not achievable through manual review. From this dataset, we analyzed motor-unit discharge rate across 208 studies examining seven muscles. Our analyses reveal that discharge rates differ significantly among muscles (p<0.001), with biceps brachii exhibiting the highest rates (15.9 pps), followed by first dorsal interosseous (13.7 pps) and tibialis anterior (13.5 pps), whereas gastrocnemius (11.3 pps), the vastii muscles (11.5 pps) and soleus show the lowest rates (9.9 pps). Sex-stratified analysis shows females exhibit higher discharge rates than males (14.5 vs 11.9 pps; Cohens d=0.38, p=0.018). In contrast, age-stratified analysis reveals non-significant differences between young and older adults (d=-0.24, p=0.072). Collectively, these results show that current views of human motor units are limited to a few muscles, with little data on females and older adults. The complete structured database is available through an open-access interactive platform (https://neuro-mechanix.com/metadata), enabling researchers to explore, filter, and download the extracted metadata. NeuromechaniX provides infrastructure for large-scale meta-research, identification of literature gaps, and hypothesis generation for the neuromechanics community.

6

BioDCASE: Using data challenges to make community advances in computational bioacoustics

Stowell, D.; Nolasco, I.; McEwen, B.; Vidana Vila, E.; Jean-Labadye, L.; Benhamadi, Y.; Lostanlen, V.; Dubus, G.; Hoffman, B.; Linhart, P.; Morandi, I.; Cazau, D.; White, E.; White, P.; Miller, B.; Nguyen Hong Duc, P.; Schall, E.; Parcerisas, C.; Gros-Martial, A.; Moummad, I.

2026-04-06 animal behavior and cognition 10.64898/2026.04.02.716062 medRxiv

Top 0.1%

17.5%

Show abstract

Computational bioacoustics has seen significant advances in recent decades. However, the rate of insights from automated analysis of bioacoustic audio lags behind our rate of collecting the data - due to key capacity constraints in data annotation and bioacoustic algorithm development. Gaps in analysis methodology persist: not because they are intractable, but because of resource limitations in the bioacoustics community. To bridge these gaps, we advocate the open science method of data challenges, structured as public contests. We conducted a bioacoustics data challenge named BioDCASE, within the format of an existing event (DCASE). In this work we report on the procedures needed to select and then conduct useful bioacoustics data challenges. We consider aspects of task design such as dataset curation, annotation, and evaluation metrics. We report the three tasks included in BioDCASE 2025 and the resulting progress made. Based on this we make recommendations for open community initiatives in computational bioacoustics.

7

BrainPET Studio: An Atlas-Based, User-Friendly Desktop Tool for Quantitative PET Neuroimaging Analysis

Nabizadeh, F.

2026-04-13 bioinformatics 10.64898/2026.04.09.717450 medRxiv

Top 0.1%

15.0%

Show abstract

Quantitative analysis of positron emission tomography (PET) neuroimaging data is essential for studying neurodegenerative diseases, yet existing processing pipelines often rely on computationally intensive software packages such as FreeSurfer, limiting accessibility for many research groups. Here I introduce BrainPET Studio, an open-source desktop application for atlas-based regional PET quantification that operates entirely in Montreal Neurological Institute (MNI) standard space. BrainPET Studio integrates affine registration, optional Muller-Gartner (MG) partial volume correction (PVC), interactive quality control (QC), and standardized uptake value ratio (SUVR) calculation into a single graphical user interface (GUI), eliminating the requirement for FreeSurfer-based cortical reconstruction. I validated BrainPET Studio against two established pipelines: (1) the UC Berkeley Alzheimers Disease Neuroimaging Initiative (ADNI) AV1451 (flortaucipir) pipeline, which employs FreeSurfer v7.1.1 parcellation, SPM-based coregistration, and Geometric Transfer Matrix (GTM) PVC in native subject space; and (2) the volBrain/petBrain online platform. Region-of-interest (ROI) SUVR values were compared across 322 subjects. Overall Pearson correlation coefficients for meta-ROI composites ranged from r = 0.83-0.96 versus ADNI and r = 0.86-0.94 versus volBrain/petBrain. Detailed per-subject validation on four representative cases across 112 FreeSurfer-defined regions demonstrated strong agreement for large cortical composites and acceptable variability for smaller medial temporal structures. These results establish BrainPET Studio as a reliable, accessible, and extensible tool for multi-site PET research, educational applications, and studies where FreeSurfer-based processing is impractical.

8

Corpus for Benchmarking Clinical Speech De-identification

Dai, H.-J.; Fang, L.-C.; Mir, T. H.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.

2026-04-03 health informatics 10.64898/2026.03.31.26349906 medRxiv

Top 0.1%

14.3%

Show abstract

Objectives Publicly available datasets dedicated to clinical speech deidentification tasks remain scarce due to privacy constraints and the complexity of speech-level annotation. To address this gap, we compiled the SREDH-AICup sensitive health information (SHI) speech corpus, a time-aligned clinical speech dataset annotated across 38 SHI categories. Methods Two publicly available English medical-domain datasets were adapted to support speech-level de-identification, including script reformulation and controlled re-recorded by 25 participants. Additional Mandarin Chinese clinical-style materials were incorporated to extend linguistic coverage. All audio data were annotated with million-level, time-aligned SHI spans using Label Studio. Inter-annotator agreement was evaluated using Cohen's kappa, following iterative calibration rounds. The resulting corpus supports both automatic speech recognition (ASR) and speech-level recognition of SHIs. Results The final dataset comprises 20 hours of annotated audio, divided into training (10 hours, 1,539 files), validation (5 hours, 775 files), and test (5 hours, 710 files) subsets, totalling 7,830 SHI entities. The language distribution reflects the composition of the selected source materials, with 19.36 hours of English and 0.89 hours of Mandarin Chinese speech. Discussion The corpus exhibits a long-tail distribution consistent with clinical documentation patterns and highlights the limited availability of Chinese medical speech resources. These characteristics underscore both the realism of the dataset and structural challenges associated with multilingual speech de-identification. Conclusion The SREDH-AICup SHI speech corpus provides a clinically grounded, time-aligned speech dataset supporting automated medical speech de-identification research and facilitating future development of multilingual speech-based privacy protection systems.

9

A Manual of Procedures for the Generation of the AI-Ready and Exploratory Atlas for Diabetes Insights (AI-READI) Database.

Matthies, D. S.; Edberg, J. C.; Baxter, S. L.; Lee, A. Y.; Lee, C. S.; McGwin, G.; Owen, J. P.; Zangwill, L. M.; Owsley, C.; AI-READI Consortium,

2026-04-04 endocrinology 10.64898/2026.03.30.26349552 medRxiv

Top 0.1%

14.2%

Show abstract

The ability to understand and affect the course of complex, multi-system diseases like diabetes has been limited by a lack of well-designed, high-quality and large multimodal datasets. The NIH Bridge2AI AI-READI project (aireadi.org) aims to address this shortfall by generating an AI-ready dataset to support AI discoveries in type 2 diabetes mellitus (T2DM). This manual of procedures provides a detailed description of the AI-READI protocol.

10

A Community Standard Multispecies Cell Atlas of the Basal Ganglia

Ecker, J. R.; Hawrylycz, M.; Lein, E.; Ren, B.; Thompson, C.; Zeng, H.; White, O.; Zhang, G.-Q.

2026-04-15 neuroscience 10.64898/2026.04.14.717814 medRxiv

Top 0.1%

14.1%

Show abstract

The NIH BRAIN Initiative Cell Atlas Network (BICAN) aims to generate a standardized, integrated cell atlas of the human, macaque, marmoset, and mouse brain that serves as a foundational community reference for the classification and study of brain cell types. Here we present the first major component of this effort: a cross-species, multimodal atlas of the basal ganglia, a group of subcortical nuclei central to motor control and implicated in a broad range of neurological disorders. Grounded in large-scale single-cell transcriptomic classification and integrated with epigenomic and spatial genomic modalities, this resource is enabled by coordinated cross-species sampling and harmonized analytical frameworks. It provides extensive phenotypic characterization of cell types, incorporates community-informed annotation, and establishes a highly curated, data-driven taxonomy with standardized nomenclature. The atlas is anchored to species-specific anatomical reference frameworks and linked across species through unified structural ontologies, enabling consistent cross-species comparisons. Multiple complementary datasets are mapped to this reference, including multiomic profiles and developmental trajectories aligned to adult cell states. Realization of this resource has required coordinated standards for tissue processing across human and model organisms, harmonization of donor metadata across brain banks, and the development of unified anatomical reference systems. To support these advances, BICAN has established an integrated ecosystem comprising standardized sequencing pipelines, neuroanatomically grounded data infrastructure, scalable visualization and mapping tools, and interoperable metadata standards. Analogous to the standardization achieved in genome science, this ecosystem provides a FAIR (findable, accessible, interoperable, and reusable) framework that enables researchers to map, compare, and interpret diverse datasets against a shared reference and associated knowledge base. The BICAN reference system is now being extended to the whole brain, with principles that are readily generalizable to other organ systems.

11

A standardized naturalistic audio stimuli database with unsupervised labeling

Al-Naji, A.; Schubotz, R. I.; Zahedi, A.

2026-04-21 neuroscience 10.64898/2026.04.16.718910 medRxiv

Top 0.1%

12.7%

Show abstract

Research in cognitive neuroscience has relied on simple, highly controlled stimuli due to the difficulty in developing standardized, ecologically valid stimulus sets. However, there is a consensus that using ecologically valid stimuli is imperative to generalize results beyond controlled laboratory settings. The current study introduces a naturalistic audio stimulus database, consisting of short, recognizable, and emotionally rated stimuli. To create such a database, the current study collected 291 audio files from a wide range of sources. 361 participants rated the audio clips on emotionality, arousal, and recognizability, and subsequently freely described the audios by typing what they believed the sound to be. The text responses of the participants were embedded and clustered using an unsupervised machine-learning algorithm to derive a participant-grounded organization of auditory object categories. The results indicate audio clips were easily recognizable, while emotionality and arousal ratings showed broad variability, making the database suitable for diverse experimental needs. Furthermore, the final database comprises 10 distinct semantic categories, providing a diverse set of auditory stimuli.

12

The Celiac Microbiome Repository (CMR): A Curated Collection of Celiac Disease Gut Microbiome Sequencing Data

Bishop, H. V.; Prendergast, P. J.; Herbold, C. W.; Ogilvie, O. J.; Dobson, R. C. J.

2026-03-31 bioinformatics 10.64898/2026.03.28.715053 medRxiv

Top 0.1%

10.2%

Show abstract

Celiac disease is an autoimmune condition where the gut microbiome is increasingly recognised as a key environmental factor. While high-throughput sequencing has led to a surge in celiac-related gut microbiome profiling data, these datasets remain fragmented, heterogeneous, and often lack the metadata required for large-scale integration into pooled, cross-cohort datasets. To address this, we developed the Celiac Microbiome Repository (CMR), a curated, open-access collection of celiac-related 16S rRNA gene and shotgun metagenomic sequencing datasets. We employed a systematic curation workflow to identify datasets across the NCBI Sequence Read Archive (SRA) and Scopus, followed by manual metadata extraction and direct author engagement. All 16S data was reprocessed through DADA2 and shotgun data through MetaPhlAn4 to facilitate comparison across studies. The CMR version 1.0 comprises 28 datasets containing 3,245 samples from 13 countries and 5 body sites. Our analysis reveals that while publicly available celiac microbiome samples have accumulated at a rate of approximately 140 per year, significant barriers to accessibility exist. Just 20 of 58 eligible datasets were found to have both raw data and essential metadata readily available within public archives. The repository features a dual-interface design, consisting of a GitHub backend for programmatic access and an R Shiny frontend for interactive data exploration. By providing this curated and harmonised resource, the CMR enables the research community to leverage public data for global meta-analyses and machine learning applications. Ultimately, this work provides the foundation needed to move beyond isolated, small-scale studies toward high-powered discoveries in celiac disease research. Database URLs: https://github.com/CeliacMicrobiomeRepo/celiac-repository | https://celiac.shinyapps.io/celiac-webapp

13

MTB-KB: A Curated Knowledgebase of Mycobacterium tuberculosis Related Studies

Li, P.; Li, C.; Zhu, R.; Sun, W.; Zhou, H.; Fan, Z.; Yue, L.; Zhang, S.; Jiang, X.; Luo, Q.; Han, J.; Huang, H.; Shen, A.; Bahetibieke, T.; Wang, J.; Zhang, W.; Wen, H.; Niu, H.; Bu, C.; Zhang, Z.; Xiao, J.; Gao, R.; Chen, F.

2026-04-10 bioinformatics 10.64898/2026.04.07.716833 medRxiv

Top 0.1%

10.2%

Show abstract

Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), has regained its position as the worlds leading killer among infectious diseases. Despite extensive research progress across epidemiology, diagnosis, drug development, treatment regimens, vaccines, drug resistance, virulence factors, and immune mechanisms, MTB-related knowledge remains fragmented across thousands of publications, limiting its effective use. To address this gap, we present MTB-KB, a literature-curated knowledgebase that systematically integrates high-impact findings from eight major sections of TB research. The current release contains 75,170 associations from 1,246 publications, covering 18,439 entities standardized using authoritative databases and WHO-endorsed classifications. A central feature is the interactive knowledge graph, which links cross-section associations to reveal and infer MTB-host interactions, treatment strategies, and vaccine development opportunities. MTB-KB also provides a user-friendly interface with browsing, advanced search, and statistical visualization. Overall, by consolidating dispersed MTB knowledge into a structured and accessible platform, MTB-KB provides a valuable resource for researchers, clinicians, and policymakers, supporting both basic and clinical TB research, enabling evidence-based TB prevention, diagnosis, and treatment, and contributing to global elimination efforts. MTB-KB is accessible at https://ngdc.cncb.ac.cn/mtbkb/.

14

Multi-Contrast MRI Inputs Enable Self-Consistent Tissue Segmentation & Robust Perivascular Space Identification

Gunter, J. L.; Preboske, G. M.; Persons, B.; Przybelski, S. A.; Schwarz, C. G.; Low, A.; Vemuri, P.; Petersen, R.; Jack, C. R.

2026-04-07 neuroscience 10.64898/2026.04.03.716409 medRxiv

Top 0.2%

8.0%

Show abstract

Different MRI image contrasts are designed to highlight various tissue properties and combining them allows extension of probabilistic segmentation beyond the commonly used "gray-white-CSF" models. This work describes a fully automated method that combines T1-weighted, T2-FLAIR, and conventional T2-weighted images to provide internal consistency across prediction of tissue segmentations including segmentation of superficial and deep gray matter, white matter hyperintensities, and MR-visible perivascular spaces. Results from 773 imaging datasets from 403 participants in the Mayo Clinic Study of Aging and Mayo Clinic Alzheimers Disease Research Center (ADRC) are presented.

15

Quality Assurance Strategies for Brain State Characterization by MEMRI

Uselman, T. W.; Jacobs, R. E.; Bearer, E. L.

2026-04-14 neuroscience 10.64898/2026.04.10.717774 medRxiv

Top 0.2%

6.8%

Show abstract

BackgroundManganese-enhanced magnetic resonance imaging (MEMRI) is a powerful approach for mapping brain-wide neural activity and axonal projections in vivo. Yet standardized computational frameworks for voxel-wise and atlas-based characterization of brain states across large experimental cohorts remain limited. New methodHere, we present methodological advances for preprocessing and statistical analysis of MEMRI datasets to support scalable, reproducible cohort-level analyses. Quality assurance metrics were developed to evaluate images, cohort-level anatomical alignment, and intensity normalization. Using simulated data, we optimized smoothing, effect-size, and cluster-size thresholds to balance sensitivity and specificity in voxel-wise statistical mapping. We developed InVivoSegment software to apply to our new InVivo Atlas for segmentation of MEMRI data and interpretation of brain-wide activity. ResultsQuality assurance analyses established benchmarks for Mn(II)-induced signal- and contrast-to-noise evaluation, precise cohort-level alignment at 100 m isotropic resolution, and robust intensity normalization. Balanced accuracy and Youdens J statistics were calculated from simulated true positive and noise-only intensities, which defined optimal parameters for smoothing kernel, cluster-size and effect-size thresholds during voxel-wise mapping. Segmentation of simulated data demonstrated reliable transformation of voxel-wise results into regional summaries and identified secondary thresholds that minimize noise-driven artifacts. Comparison with existing methodsApproach to optimize correction parameters for statistical mapping using simulated images improves voxel- and segment-wise sensitivity compared to FDR/FWE-based correction procedures. ConclusionsThese methodological advances enable scalable, reproducible, brain-wide quantification of longitudinal changes in MEMRI studies, strengthen mechanistic investigation of brain-state dynamics relevant to human health, and provide broadly applicable tools for neuroimaging analyses beyond MEMRI applications. HighlightsO_LIQuantitative assurance of image quality complements visual assessment for cohort-level batch processing. C_LIO_LIOptimization of parameters using simulated noise-only images with and without investigator-embedded signal for voxel-wise mapping. C_LIO_LIA new software, "InVivoSegment" together with a labeled atlas, automates reliable user-friendly segmentation of voxel-wise data. C_LIO_LIMethodological advances in MEMRI data processing and computational analyses support scalable voxel- and segment-wise quantification of brain-wide neural activity. C_LI

16

Large-scale automated detection of gray whales off California in panchromatic and multispectral satellite imagery.

HOUEGNIGAN, L.; Cuesta Lazaro, E.

2026-04-19 bioinformatics 10.64898/2026.04.15.718679 medRxiv

Top 0.2%

6.7%

Show abstract

Increasing human activities along the US west coast are of concern for populations of cetaceans and particularly for a number of large whale species that are recovering from overexploitation during the era of commercial whaling. New rapid monitoring tools, such as satellite imagery analysis powered by recent advances in artificial intelligence, have potential to provide additional broad-scale and near real-time capacities for survey and monitoring. This paper investigates and demonstrates the feasibility of automatic detection of gray whales in sub-meter satellite imagery off the coast of California, USA. Observations and statistical analysis of regional imagery allowed not only an assessment of their detectability but also the development of robust signal processing and machine learning-based solutions for automated detection. To that end, a regional dataset of 221 gray whales was created using signal processing to inform a deep-learning-based detection framework, and 20 different large neural network architectures for feature extraction followed by a support vector machine algorithm for classification were evaluated for their detection performance. Neural network backbones included 19 convolutional neural networks and 1 transformer network. The best architecture generally achieved satisfying performance with an average balanced accuracy reaching up to 99.90%. It is also demonstrated that panchromatic imagery, in spite of the lesser amount of information provided, can be used to perform detection with a relatively high accuracy of 87.05%, allowing wider spatial and temporal coverage. Large-scale deployment of the best performing models over a broad range of regional satellite imagery resulted in the detection of 3353 gray whales, as well as opportunistic detections of humpback, blue and fin whales, in and going from December 28th 2009 to March 26th 2023. It also provided meaningful data points concerning the migration routes of gray whales within the Channel Islands and Southern California Bight. The large number of high-confidence detections indicates the capacity for a large-scale monitoring approach to support state and federal conservation policies such as gear mitigation, vessel speed reduction programs, or shipping lane redefinition that could also be expanded to other areas and for other species.

17

Validated Synthetic Data Generation from a Multicenter Spine Surgery Registry: Methodology and Benchmark

Challier, V.; Jacquemin, C.; Diebo, B.; Dehouche, N.; Denisov, A.; Cristini, J.; Campana, M.; Castelain, J.-E.; Lonjon, G.; Lafage, V.; Ghailane, S.; SpineDAO Collaborative Group,

2026-04-11 health informatics 10.64898/2026.04.07.26350316 medRxiv

Top 0.2%

6.6%

Show abstract

BackgroundSynthetic data have emerged as a complementary strategy for secondary use of clinical registries, enabling data sharing without patient-level exposure. In spine surgery, multicenter data sharing is constrained by institutional governance and patient privacy regulations. Validated synthetic data generation may enable broader access to surgical outcomes data for artificial intelligence development without compromising patient confidentiality. ObjectiveTo describe and benchmark a three-domain validated synthetic data pipeline applied to a multicenter, tokenized spine surgery registry (SpineBase), and to establish a reproducible certification framework for synthetic spine surgery datasets. MethodsWe extracted 125 sacroiliac joint fusion cases from the SpineBase registry (SIBONE study, IRB-SOFCOT approval Ref. 14-2025; CNIL MR-004 Ref. 2234503 v 0). A GaussianCopula generative model was trained on 52 structured variables spanning demographics, preoperative assessments, operative details, and longitudinal outcomes at 3, 6, 12, and 24 months. Synthetic datasets of 100, 1,000, and 10,000 patients were generated. Validation followed a three-domain framework: (1) fidelity, assessed by Kolmogorov-Smirnov tests and Jensen-Shannon divergence; (2) utility, assessed by train-on-synthetic, test-on-real (TSTR) methodology; and (3) privacy, assessed by nearest-neighbor distance ratio (NNDR), membership inference attack, and k-anonymity proxy. ResultsAll three validation gates passed. Fidelity: mean KS p-value 0.52 (threshold >0.05). Privacy: NNDR >1.0 in 98.9% of synthetic records; membership inference AUROC 0.57. Utility: 12-month Oswestry Disability Index prediction yielded Pearson r = 0.29, consistent with expected attenuation at N = 125. A SHA-256 cryptographic hash of each certified dataset was anchored on the Solana blockchain for immutable provenance. ConclusionsA validated, blockchain-anchored synthetic data pipeline for spine surgery registries is technically feasible and meets current publication-standard criteria for fidelity and privacy. Utility metrics scale with registry size, creating a direct incentive for multicenter data contribution. This framework provides a reproducible methodology for synthetic data certification in spine surgery research, and establishes certified synthetic datasets as a privacy-native substrate for expert-annotation pipelines -- as demonstrated in the companion Spine Reviews study.

18

Highly replicable multisite patterns of adolescent white matter maturation

Meisler, S. L.; Cieslak, M.; Bagautdinova, J.; Hendrickson, T. J.; Pandhi, T.; Chen, A. A.; Hillman, N.; Radhakrishnan, H.; Salo, T.; Feczko, E.; Weldon, K. B.; McCollum, r.; Fayzullobekova, B.; Moore, L. A.; Sisk, L.; Davatzikos, C.; Huang, H.; Avelar-Pereira, B.; Caffarra, S.; Chang, K.; Cook, P. A.; Flook, E. A.; Gomez, T.; Grotheer, M.; Hagen, M. P.; Huque, Z. M.; Karipidis, I. I.; Keller, A. S.; Kruper, J.; Luo, A. C.; Macedo, B.; Mehta, K.; Mitchell, J. L.; Pines, A. R.; Pritschet, L.; Rauland, A.; Roy, E.; Sevchik, B. L.; Shafiei, G.; Singleton, S. P.; Stone, H. L.; Sun, K. Y.; Sydnor,

2026-04-19 neuroscience 10.64898/2026.04.18.719321 medRxiv

Top 0.2%

6.5%

Show abstract

The Adolescent Brain Cognitive Development (ABCD) Study is the largest U.S.-based neuroimaging initiative of adolescent brain maturation. Diffusion MRI (dMRI) provides unique insights into white matter organization, yet applying advanced processing pipelines and managing technical variability across scanning environments remains challenging at scale. To address these issues, we present ABCD-BIDS Community Collection (ABCC) release 3.1.0, including a curated resource of more than 24,000 fully processed ABCD dMRI datasets. ABCC provides fully processed images, nuanced image quality metrics, advanced microstructural measures, and person-specific bundle tractography. Evaluating these rich data revealed that measures of diffusion restriction and non-Gaussianity--in particular the intracellular volume fraction from NODDI and return-to-origin probability from MAP-MRI--were highly sensitive to neurodevelopment and robust to variation in image quality. Additionally, harmonization of microstructural features markedly improved the cross-vendor generalizability of developmental effects. Together, ABCC accelerates reproducible, rigorous research on adolescent white matter development.

19

Connectome-based spatial statistics enabling large-scale population analyses of human connectome across cohorts

Li, T.; Wang, X.; Cole, M.; Sun, Z.; Jiang, Z.; Qian, X.; Gao, S.; Luo, T.; Descoteaux, M.; Stein, J. L.; Wang, X.; Nichols, T. E.; Zhang, H.; Zhang, Z.; Zhu, H.

2026-04-10 neuroscience 10.64898/2026.04.09.717492 medRxiv

Top 0.3%

6.4%

Show abstract

Large-scale population analyses of structural connectome organization remain challenging because of cross-subject alignment, pathway interpretability and computational burden. No widely adopted standard exists for systematic evaluation across processing methods. We developed connectome-based spatial statistics (CBSS), a scalable framework for anatomically aligned and functionally informed quantification of white-matter microstructure that yields atlas-defined pathways organized into 13 functional networks. Using data from 56,510 UK Biobank participants together with five independent lifespan cohorts, we evaluated the streamline-, voxel- and network-level measures in the aspects of reliability, heritability, structure-function coupling, cognitive and behavioral prediction, brain aging patterns and lifespan trajectories across cohorts. The systematic evaluation workflow compares population-level white-matter representations across methods, spatial scales, tasks and datasets. The results support CBSS as a common connectome reference for large-scale, cross-cohort diffusion MRI studies.

20

Summarizing data from continuous glucose monitors using the cgmstats package

Daya, N. R.; Wang, D.; Zhang, S.; Fang, M.; Wallace, A.; Zeger, S.; Selvin, E.

2026-03-31 epidemiology 10.64898/2026.03.30.26349753 medRxiv

Top 0.3%

6.3%

Show abstract

In this article, we present the cgmstats package for the analysis of continuous glucose monitoring (CGM) data. The use of wearable CGMs is growing rapidly. The latest generation of CGM systems do not require fingerstick calibration, are minimally invasive, and are frequently used in research studies. CGM sensors are typically worn for up to 2 weeks and record interstitial glucose measurements every minute to every 15 minutes, depending on the sensor used. CGM systems generate hundreds of measurements per day and thousands of measurements in one person over a single wear. There is a need for tools that allow researchers to efficiently organize and summarize the wealth of data on glucose patterns produced by CGM systems. The cgmstats package generates CGM summary measures for data from a variety of CGM systems and allows the user to flexibly define ranges and generate data visualizations. In this article, we provide an overview of the cgmstats package and examples of its use. The cgmstats package supports rigorous and reproducible analyses of CGM data.